Huilin Zhou (hz2507)
Zhuo Li (zl2637)
Qing Xu (qx2178)
Zhaoyu Liu (zl2638)
Many of us will find a job in recent years. In New York City, what type of job can be payed for higher salary? If that’s the ideal job for me , what skills should I have before applying for it? Also where should I rent an apartment if I want to live close to that type of positions? By looking at the NYC jobs data set from 2013 to 2017,we wish to give some advice on job catrgory, required skills and salary for people who are seeking for a job.
What questions are you trying to answer? How did these questions evolve over the course of the project? What new questions did you consider in the course of your analysis?
Source: https://data.cityofnewyork.us/City-Government/NYC-Jobs/kpav-sd4t/data
Scrape:We doWnloaded the data set from the website and the original data set contains 3174 job information. We selected data with information about work location, job category, preferred skills and full/part time. We want to use Google Map to get longtitute and latitute from work locations. Since the Google map has a limit of 2500 observation one time, we selected first 2500 observations from the data set.
Cleaning:
1. Merged job category if they are the same kind but just have different names. Finally we got 12 kinds of job categories.
2. Unified salary unit to “Annual” and recalculated the salary range and average salary.
3. Used Google Map to change location to longitute and latitute.
#clean data
df<- read.csv("NYC_Jobs.csv") %>%
clean_names() %>%
filter(job_category!= " ",
full_time_part_time_indicator!=" ",
preferred_skills!= " ",
work_location!=" ") %>%
head(2500) #google map limited to 2500
df$job_category<- as.character(df$job_category)
#combine job category (function?)
for (i in 1:2500) {
df$job_category[[i]][str_detect(df$job_category[[i]],regex("engineering", ignore_case = TRUE))] <- "Engineering"
df$job_category[[i]][str_detect(df$job_category[[i]],regex("finance", ignore_case = TRUE))] <- "Finance"
df$job_category[[i]][str_detect(df$job_category[[i]],regex("legal", ignore_case = TRUE))] <- "Legal Affairs"
df$job_category[[i]][str_detect(df$job_category[[i]],regex("technology", ignore_case = TRUE))] <- "Technology"
df$job_category[[i]][str_detect(df$job_category[[i]],regex("policy", ignore_case = TRUE))] <- "Policy"
df$job_category[[i]][str_detect(df$job_category[[i]],regex("communi", ignore_case = TRUE))] <- "Community"
df$job_category[[i]][str_detect(df$job_category[[i]],regex("clerical", ignore_case = TRUE))] <- "Clerical"
df$job_category[[i]][str_detect(df$job_category[[i]],regex("social service", ignore_case = TRUE))] <- "Social service"
df$job_category[[i]][str_detect(df$job_category[[i]],regex("public safety", ignore_case = TRUE))] <- "Public safety"
df$job_category[[i]][str_detect(df$job_category[[i]],regex("administration", ignore_case = TRUE))] <- "Administration"
df$job_category[[i]][str_detect(df$job_category[[i]],regex("operations", ignore_case = TRUE))] <- "Maintance"
df$job_category[[i]][str_detect(df$job_category[[i]],regex("maintenance ", ignore_case = TRUE))] <- "Maintance"
}
# change unit of salary and calculate mean
df$salary_frequency<-as.character((df$salary_frequency))
df$salary_range_from<- as.numeric(df$salary_range_from)
df$salary_range_to<- as.numeric(df$salary_range_to)
df$salary_range_from <- with(df, ifelse(salary_frequency == "Hourly", salary_range_from*1825,salary_range_from))
df$salary_range_to <- with(df, ifelse(salary_frequency == "Hourly", salary_range_to*1825,salary_range_to))
df$salary_range_from <- with(df, ifelse(salary_frequency == "Daily", salary_range_from*365*5/7,salary_range_from))
df$salary_range_to <- with(df, ifelse(salary_frequency == "Daily", salary_range_to*365*5/7,salary_range_to))
df<-mutate(df,salary_frequency="Annual") %>%
mutate(salary_mean = (salary_range_to-salary_range_from)/2+salary_range_from)
df$salary_frequency<-as.character((df$salary_frequency))
df$salary_range_from<- as.numeric(df$salary_range_from)
df$salary_range_to<- as.numeric(df$salary_range_to)
#function?
df$salary_range_from <- with(df, ifelse(salary_frequency == "Hourly", salary_range_from*1825,salary_range_from))
df$salary_range_to <- with(df, ifelse(salary_frequency == "Hourly", salary_range_to*1825,salary_range_to))
df$salary_range_from <- with(df, ifelse(salary_frequency == "Daily", salary_range_from*365*5/7,salary_range_from))
df$salary_range_to <- with(df, ifelse(salary_frequency == "Daily", salary_range_to*365*5/7,salary_range_to))
df<-mutate(df,salary_frequency="Annual") %>%
mutate(salary_mean = (salary_range_to-salary_range_from)/2+salary_range_from)
# get lon and lat
df$work_location<-as.character(df$work_location)
geo <- geocode(location = df$work_location, output="latlon", source="google")
df<- df %>%
mutate(lon = geo$lon, lat = geo$lat) %>%
group_by(job_category)
df <- df %>%
mutate(text_label = str_c("Annual mean salary$:",salary_mean))
write.csv(df,file = "job.csv",row.names=FALSE)
First, we use longtitue and latitute of 2500 observations and plot their location in the map by different job category and text with average annual salary.
# plotly worldwild
Sys.setenv('MAPBOX_TOKEN' = 'pk.eyJ1IjoieHVxaW5nc2FsbHkiLCJhIjoiY2phZWh0djdyMHUzZTJ3bGR3MHFsdmIzZSJ9.vhYtu7zeAAuX6slhdDj6lA')
nyc_jobs=read_csv("job.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## job_id = col_integer(),
## x_of_positions = col_integer(),
## salary_range_from = col_double(),
## salary_range_to = col_double(),
## posting_date = col_datetime(format = ""),
## post_until = col_datetime(format = ""),
## posting_updated = col_datetime(format = ""),
## process_date = col_datetime(format = ""),
## salary_mean = col_double(),
## lon = col_double(),
## lat = col_double()
## )
## See spec(...) for full column specifications.
plot_mapbox(nyc_jobs,lat = ~lat, lon = ~lon,
size=2,
split = nyc_jobs$job_category,
mode = 'scattermapbox') %>%
add_markers(
text = ~text_label,
color = ~job_category,size = I(9)) %>%
layout(title = 'Work Location',
font = list(color='white'),
plot_bgcolor = '#191A1A', paper_bgcolor = '#191A1A',
mapbox = list(style = 'dark'),
legend = list(orientation = 'h',
font = list(size = 8)),
margin = list(l = 25, r = 25,
b = 25, t = 25,
pad = 2))
## Warning: Ignoring 81 observations
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
#alternative map
#For location in NYC(2500-611)
myLocation_nyc = c(-74.23,40.58,-73.75,40.87)
myMap_nyc <- get_map(location=myLocation_nyc,
source = "stamen",maptype = "watercolor", crop = FALSE)
## Map from URL : http://tile.stamen.com/watercolor/11/601/768.jpg
## Map from URL : http://tile.stamen.com/watercolor/11/602/768.jpg
## Map from URL : http://tile.stamen.com/watercolor/11/603/768.jpg
## Map from URL : http://tile.stamen.com/watercolor/11/604/768.jpg
## Map from URL : http://tile.stamen.com/watercolor/11/601/769.jpg
## Map from URL : http://tile.stamen.com/watercolor/11/602/769.jpg
## Map from URL : http://tile.stamen.com/watercolor/11/603/769.jpg
## Map from URL : http://tile.stamen.com/watercolor/11/604/769.jpg
## Map from URL : http://tile.stamen.com/watercolor/11/601/770.jpg
## Map from URL : http://tile.stamen.com/watercolor/11/602/770.jpg
## Map from URL : http://tile.stamen.com/watercolor/11/603/770.jpg
## Map from URL : http://tile.stamen.com/watercolor/11/604/770.jpg
## Map from URL : http://tile.stamen.com/watercolor/11/601/771.jpg
## Map from URL : http://tile.stamen.com/watercolor/11/602/771.jpg
## Map from URL : http://tile.stamen.com/watercolor/11/603/771.jpg
## Map from URL : http://tile.stamen.com/watercolor/11/604/771.jpg
ggmap(myMap_nyc)+
geom_point(aes(x = lon, y = lat), data = nyc_jobs,
alpha = .5, color="darkred",size = 3) +
labs(title = "Work location in NYC")
## Warning: Removed 409 rows containing missing values (geom_point).
We draw the distribution of average annual salary of different job categories.
nyc_jobs %>%
plot_ly(y = ~salary_mean, color = ~job_category, type = "box",
colors = "Set2") %>%
layout(title = "Annual average salary distribution in 12 job categories")
# Job category with different educational requirement and base salary
job_data=nyc_jobs %>%
select(job_category, salary_range_from, salary_range_to,minimum_qual_requirements,full_time_part_time_indicator,salary_frequency) %>%
filter(job_category!= " ", minimum_qual_requirements!=" ",full_time_part_time_indicator =="F",salary_frequency=="Annual")
x=c("baccalaureate", "Bachelor")
y=c("Master","master")
master_data=filter(job_data,grepl(paste(y, collapse = "|"),minimum_qual_requirements),!grepl(paste(x, collapse = "|"),minimum_qual_requirements))
baccalaureate_data = filter(job_data,grepl(paste(x, collapse = "|"),minimum_qual_requirements),!grepl(paste(y, collapse = "|"),minimum_qual_requirements))
Other_data=filter(job_data,!grepl(paste(y, collapse = "|"),minimum_qual_requirements),!grepl(paste(x, collapse = "|"),minimum_qual_requirements))
plot_ly(master_data, y = ~salary_range_from, color = ~job_category, type = "box", colors = "Set2") %>%
layout(title = "Base salary of jobs required at least master degree")
plot_ly(baccalaureate_data, y = ~salary_range_from, color = ~job_category, type = "box", colors = "Set2") %>%
layout(title = "Base salary of jobs required baccalaureate degree(No need master)")
plot_ly(Other_data, y = ~salary_range_from, color = ~job_category, type = "box",
colors = "Set2") %>%
layout(title = "Base salary of different kind of jobs without requirement of degree")
For those three plots, we see that in the boxplot of base salary of jobs required for at least master degree, engineering has the highest median base salary and the overall base salary of Legal Affairs are pretty high compared with other job category. While in the boxplot of base salary of jobs required for baccalaureate technology has the highest median base salary, and the overall salary of Administration is pretty high compared with others. In the boxplot of base salary without requirement of degree, the median base salary and the overall base salary of Engineering are all pretty high compared with other job categories.
Comparing those three plots, we can see the overall base salary of jobs required for at least master degree are higher than the overall base salary of jobs required for baccalaureate. The distribution of the base salary of jobs without degrees’ requirement is not centralized and the base salaries mainly depend on the job positon in this case.
job_data=mutate(job_data, salary_range = salary_range_to - salary_range_from)
plot_ly(job_data, y = ~salary_range,x = ~job_category, type = "bar") %>%
layout(title = "The wage increasing ranges of different kinds of jobs")
From the plot, we can see that the job category of Legal Affairs has the largest wage increasing range and the job category of Clerical has the smallest wage increasing range compared with other different job categories.
job_positions = nyc_jobs %>%
select(x_of_positions, agency, job_category, salary_mean) %>%
distinct() %>%
group_by(agency, job_category) %>%
summarise(positions = sum(x_of_positions)) %>%
arrange(desc(positions))
# Number of job positions: top 10
knitr::kable(head(job_positions, 10))
| agency | job_category | positions |
|---|---|---|
| DEPT OF HEALTH/MENTAL HYGIENE | Health | 382 |
| ADMIN FOR CHILDREN’S SVCS | Social service | 100 |
| DEPT OF ENVIRONMENT PROTECTION | Engineering | 82 |
| DEPARTMENT OF BUILDINGS | Public safety | 71 |
| DEPARTMENT OF PROBATION | Public safety | 65 |
| DEPARTMENT OF TRANSPORTATION | Engineering | 65 |
| DEPARTMENT OF TRANSPORTATION | Maintance | 60 |
| LAW DEPARTMENT | Legal Affairs | 56 |
| NYC HOUSING AUTHORITY | Maintance | 56 |
| DEPT OF HEALTH/MENTAL HYGIENE | Community | 55 |
The table above shows the top-10 number of job positions in NYC from 2011 to 2017. DEPT OF HEALTH/MENTAL HYGIENE with job category “Health” has the most job positions.
positions_plot = nyc_jobs %>%
select(x_of_positions, agency, job_category, salary_mean, posting_date) %>%
distinct() %>%
separate(posting_date, into = c("year", "month", "day"), sep = "-") %>%
select(-month, -day) %>%
group_by(job_category, year) %>%
summarise(positions = sum(x_of_positions))
plot_ly(positions_plot, x = ~job_category, y = ~positions,
color = ~year, type = "bar") %>%
layout(title = "Number of positions of job categories in each year",
barmode = "group")
The bar chart above shows the number of positions for each job category in each year. From 2013 to 2017, the number of job positions in each category keeps increasing. And all categories have a dramatically increase in job positions in 2017. This might because more and more companies were founded and developed in 2017, thus they need more employees joining in. Besides, since this dataset contains all job information from NYC official job website, as the year increases, more people found this website and created job postings on this site.
nyc_jobs = nyc_jobs%>%
ungroup()%>%
mutate( minimum_qual_requirements = as.character(minimum_qual_requirements))%>%
mutate(preferred_skills = as.character(preferred_skills))
jobs_words_skill = nyc_jobs%>%
unnest_tokens(word,preferred_skills)%>%
anti_join(stop_words)%>%
inner_join(., parts_of_speech) %>%
count(word, sort = TRUE)
## Joining, by = "word"
## Joining, by = "word"
jobs_words_requirement = nyc_jobs%>%
unnest_tokens(word,minimum_qual_requirements)%>%
anti_join(stop_words)%>%
inner_join(., parts_of_speech) %>%
count(word, sort = TRUE)
## Joining, by = "word"
## Joining, by = "word"
jobs_words_skill %>%
top_n(20) %>%
mutate(word = fct_reorder(word, n)) %>%
plot_ly(y = ~word, x = ~n, color = ~word, type = "bar")%>%
layout(title = "Preferred Skills Word Counts")
## Selecting by n
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
jobs_words_requirement %>%
top_n(20) %>%
mutate(word = fct_reorder(word, n)) %>%
plot_ly(y = ~word, x = ~n, color = ~word, type = "bar")%>%
layout(title = "Minimum Qual Requirements")
## Selecting by n
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
The word count plots of prefferred skills and minimum qual requirements. The word count plots of prefferred skills and minimum qual requirements. After deleting meaningless words, experience is the most used word both in prefeered skills and minimum requirement. We could see that word and excel is still the most common skill needed, and other skills for communication like written, verbal public speaking and oral is also like basic skills needed. For miminum qual requirements, education is really important, for that there are several words related to education shown on the plot such like “school”, “college”, “degree”, “education”, “graduate”. Those words indicates a high requirement of education like bacherlor or master degree. And the word york indicates the requirement of residency in New York. For different knids of job, the skill and minimum requirement may change. The result of this part will be shown on the Shinny Website. The word clouds of skills and minimum requirement are shown below, which indicates a more straight view of word frequency.
set.seed(123)
wordcloud2(jobs_words_skill, size = 2,color = 'random-light',
backgroundColor = "gray", fontWeight='bold',
minRotation = -pi/3, maxRotation = pi/3,rotateRatio = 0.8)
## Warning in if (class(data) == "table") {: the condition has length > 1 and
## only the first element will be used
wordcloud2(jobs_words_requirement, size = 2,color = 'random-light',
backgroundColor = "gray", fontWeight='bold',
minRotation = -pi/3, maxRotation = pi/3,rotateRatio = 0.8)
## Warning in if (class(data) == "table") {: the condition has length > 1 and
## only the first element will be used